{As an urban planner tasked with improving pedestrian infrastructure and public transportation services, I need to understand the relationship between pedestrian activity peaks and the locations of bus stops within a given area. By analyzing these patterns, I aim to provide recommendations for optimizing bus stop locations, enhancing pedestrian facilities, and reducing congestion to promote a healthier urban environment..}
At the end of this use case you will:
- Data manipulation and analysis using pandas
- Advanced visualization techniques with matplotlib and geopandas for geospatial data analysis.
- Spatial analysis using geopandas to understand geographic relationships between pedestrian activity and bus stop locations.
- Implementation and interpretation of clustering algorithms for identifying spatial patterns in data.
- Integration of machine learning techniques (such as clustering) with spatial analysis to derive actionable insights for urban planning.
- Application of data-driven decision-making processes in urban planning contexts.ntexts.
In urban planning, understanding the dynamics of pedestrian activity and public transportation infrastructure is crucial for creating sustainable and livable cities. This use case focuses on analyzing the correlation between pedestrian activity peaks and the locations of bus stops using Python libraries such as pandas, matplotlib, and geopandas. By examining these relationships, we aim to provide insights for urban planners to optimize infrastructure and enhance the urban environment.}
writing out the dependencies¶
# Dependencies
import warnings
warnings.filterwarnings("ignore")
import requests
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import io
import seaborn as sns
import folium
from folium.plugins import HeatMap
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from sklearn.preprocessing import PolynomialFeatures
pd.set_option('display.max_columns', None)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error, r2_score
# **Preferred Method**: Export Endpoint
from io import StringIO
def API_unlimited(datasetName):
dataset_id = datasetName
# https://data.melbourne.vic.gov.au/explore/dataset/pedestrian-counting-system-monthly-counts-per-hour/information/
#dataset_id = 'pedestrian-counting-system-monthly-counts-per-hour'
base_url = 'https://data.melbourne.vic.gov.au/api/explore/v2.1/catalog/datasets/'
#apikey = api_key
dataset_id = dataset_id
format = 'csv'
url = f'{base_url}{dataset_id}/exports/{format}'
params = {
'select': '*',
'limit': -1, # all records
'lang': 'en',
'timezone': 'UTC',
# 'api_key': apikey
}
# GET request
response = requests.get(url, params=params)
if response.status_code == 200:
# StringIO to read the CSV data
url_content = response.content.decode('utf-8')
datasetName = pd.read_csv(StringIO(url_content), delimiter=';')
print(datasetName.sample(10, random_state=999)) # Test
return datasetName
else:
print(f'Request failed with status code {response.status_code}')
dataset_id_1 = 'pedestrian-counting-system-monthly-counts-per-hour'
dataset_id_2 = 'sensor-readings-with-temperature-light-humidity-every-5-minutes-at-8-locations-t'
dataset_id_3 = 'bus-stops'
pedestrian_hour = API_unlimited(dataset_id_1)
sensor_name timestamp locationid direction_1 \
299310 Col700_T 2023-06-11T00:00:00+00:00 9 72
273077 SprFli_T 2024-01-16T19:00:00+00:00 75 30
230538 Bou688_T 2023-08-22T21:00:00+00:00 58 794
545967 FLDegC_T 2024-03-12T00:00:00+00:00 69 211
41658 BouHbr_T 2023-07-06T06:00:00+00:00 10 275
311686 BouBri_T 2023-06-03T18:00:00+00:00 57 2
503749 WestWP_T 2024-02-09T08:00:00+00:00 71 10
362060 Col12_T 2023-12-08T17:00:00+00:00 18 471
90696 AG_T 2023-09-29T13:00:00+00:00 29 71
340770 ACMI_T 2023-08-06T06:00:00+00:00 72 352
direction_2 total_of_directions location
299310 93 165 -37.81982992, 144.95102555
273077 18 48 -37.81515276, 144.97467661
230538 127 921 -37.81686075, 144.95358075
545967 134 345 -37.81687226, 144.96559144
41658 61 336 -37.81876474, 144.94710545
311686 4 6 -37.8176735, 144.95025594
503749 4 14 -37.81235775, 144.97136962
362060 126 597 -37.81344862, 144.97305353
90696 94 165 -37.81965809, 144.96863453
340770 448 800 -37.81726338, 144.96872809
# View the pedestrian counts dataset
pedestrian_hour
| sensor_name | timestamp | locationid | direction_1 | direction_2 | total_of_directions | location | |
|---|---|---|---|---|---|---|---|
| 0 | SprFli_T | 2023-04-24T21:00:00+00:00 | 75 | 36 | 17 | 53 | -37.81515276, 144.97467661 |
| 1 | SprFli_T | 2023-04-25T00:00:00+00:00 | 75 | 28 | 50 | 78 | -37.81515276, 144.97467661 |
| 2 | SprFli_T | 2023-04-25T01:00:00+00:00 | 75 | 63 | 63 | 126 | -37.81515276, 144.97467661 |
| 3 | SprFli_T | 2023-04-25T02:00:00+00:00 | 75 | 85 | 89 | 174 | -37.81515276, 144.97467661 |
| 4 | SprFli_T | 2023-04-25T08:00:00+00:00 | 75 | 365 | 59 | 424 | -37.81515276, 144.97467661 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 549971 | 474Fl_T | 2024-03-18T14:00:00+00:00 | 141 | 5 | 15 | 20 | -37.81997273, 144.95834911 |
| 549972 | Hammer1584_T | 2024-03-18T02:00:00+00:00 | 142 | 349 | 285 | 634 | -37.81970749, 144.96795734 |
| 549973 | Hammer1584_T | 2024-03-18T03:00:00+00:00 | 142 | 282 | 173 | 455 | -37.81970749, 144.96795734 |
| 549974 | Hammer1584_T | 2024-03-18T08:00:00+00:00 | 142 | 260 | 261 | 521 | -37.81970749, 144.96795734 |
| 549975 | Hammer1584_T | 2024-03-18T10:00:00+00:00 | 142 | 157 | 146 | 303 | -37.81970749, 144.96795734 |
549976 rows × 7 columns
Fetching sensor reading datasets¶
sensor_reading = API_unlimited(dataset_id_2)
timestamp mac boardtype boardid \
16436 2015-02-06T07:20:00+00:00 0013a20040b31583 1 509
41268 2014-12-16T20:40:00+00:00 0013a20040b5b318 1 502
15000 2014-12-28T15:30:00+00:00 0013a20040b31583 1 509
26069 2015-03-20T02:20:00+00:00 0013a20040b516f6 1 507
49093 2015-03-09T19:55:00+00:00 0013a20040b31571 1 510
19453 2015-01-15T07:55:00+00:00 0013a20040b31583 1 509
50144 2015-03-17T03:30:00+00:00 0013a20040b31583 1 509
15189 2014-12-30T03:50:00+00:00 0013a20040b5b337 1 511
34161 2015-04-23T23:35:00+00:00 0013a20040b31583 1 509
1324 2015-02-08T10:55:00+00:00 0013a20040b31571 1 510
temp_max temp_min temp_avg light_max light_min light_avg \
16436 38.4 38.4 38.4 97.8 97.8 97.8
41268 17.7 17.7 17.7 94.3 94.3 94.3
15000 18.4 18.4 18.4 3.7 3.7 3.7
26069 17.1 17.1 17.1 91.7 91.7 91.7
49093 15.5 15.5 15.5 6.9 6.9 6.9
19453 19.0 19.0 19.0 93.4 93.4 93.4
50144 25.8 25.8 25.8 95.4 95.4 95.4
15189 23.9 23.9 23.9 97.8 97.8 97.8
34161 15.2 15.2 15.2 2.3 2.3 2.3
1324 17.7 17.7 17.7 1.6 1.6 1.6
humidity_min humidity_max humidity_avg model latitude longitude \
16436 15.0 15.0 15.0 ENV -37.819904 144.940485
41268 37.6 37.6 37.6 ENV -37.814610 144.979018
15000 62.0 62.0 62.0 ENV -37.819904 144.940485
26069 39.6 39.6 39.6 ENV -37.814922 144.982258
49093 47.5 47.5 47.5 ENV -37.819712 144.941325
19453 56.5 56.5 56.5 ENV -37.819904 144.940485
50144 30.2 30.2 30.2 ENV -37.819904 144.940485
15189 24.8 24.8 24.8 ENV -37.819500 144.941889
34161 67.8 67.8 67.8 ENV -37.819904 144.940485
1324 54.3 54.3 54.3 ENV -37.819712 144.941325
elevation location rowid position \
16436 NaN Docklands Library 509-20150206072000 NaN
41268 22.57 Fitzroy Gardens 502-20141216204000 NaN
15000 NaN Docklands Library 509-20141228153000 NaN
26069 38.79 Fitzroy Gardens 507-20150320022000 NaN
49093 2.74 Docklands Library 510-20150309195500 NaN
19453 NaN Docklands Library 509-20150115075500 NaN
50144 0.03 Docklands Library 509-20150317033000 NaN
15189 NaN Docklands Library 511-20141230035000 NaN
34161 0.03 Docklands Library 509-20150423233500 NaN
1324 NaN Docklands Library 510-20150208105500 NaN
geolocation
16436 -37.8199043, 144.9404851
41268 -37.8146097, 144.9790177
15000 -37.8199043, 144.9404851
26069 -37.8149218, 144.9822582
49093 -37.8197121, 144.9413253
19453 -37.8199043, 144.9404851
50144 -37.8199043, 144.9404851
15189 -37.8195002, 144.9418888
34161 -37.8199043, 144.9404851
1324 -37.8197121, 144.9413253
### viewing the sensor reading datasets
sensor_reading.head()
| timestamp | mac | boardtype | boardid | temp_max | temp_min | temp_avg | light_max | light_min | light_avg | humidity_min | humidity_max | humidity_avg | model | latitude | longitude | elevation | location | rowid | position | geolocation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2015-01-24T10:45:00+00:00 | 0013a20040b31571 | 1 | 510 | 19.4 | 19.4 | 19.4 | 0.9 | 0.9 | 0.9 | 52.7 | 52.7 | 52.7 | ENV | -37.819712 | 144.941325 | NaN | Docklands Library | 510-20150124104500 | NaN | -37.8197121, 144.9413253 |
| 1 | 2015-01-24T11:15:00+00:00 | 0013a20040b5b337 | 1 | 511 | 19.7 | 19.7 | 19.7 | 10.6 | 10.6 | 10.6 | 50.2 | 50.2 | 50.2 | ENV | -37.819500 | 144.941889 | NaN | Docklands Library | 511-20150124111500 | NaN | -37.8195002, 144.9418888 |
| 2 | 2015-01-24T11:15:00+00:00 | 0013a20040b31583 | 1 | 509 | 19.7 | 19.7 | 19.7 | 3.1 | 3.1 | 3.1 | 57.9 | 57.9 | 57.9 | ENV | -37.819904 | 144.940485 | NaN | Docklands Library | 509-20150124111500 | NaN | -37.8199043, 144.9404851 |
| 3 | 2015-01-24T11:55:00+00:00 | 0013a20040b31583 | 1 | 509 | 19.7 | 19.7 | 19.7 | 3.1 | 3.1 | 3.1 | 53.7 | 53.7 | 53.7 | ENV | -37.819904 | 144.940485 | NaN | Docklands Library | 509-20150124115500 | NaN | -37.8199043, 144.9404851 |
| 4 | 2015-01-24T11:55:00+00:00 | 0013a20040b31571 | 1 | 510 | 18.7 | 18.7 | 18.7 | 1.0 | 1.0 | 1.0 | 48.6 | 48.6 | 48.6 | ENV | -37.819712 | 144.941325 | NaN | Docklands Library | 510-20150124115500 | NaN | -37.8197121, 144.9413253 |
#checking for the shape
sensor_reading.shape
(56570, 21)
fetching bus stop dataset¶
bus_stop = API_unlimited(dataset_id_3)
geo_point_2d \
293 -37.78737016259562, 144.96918092237397
8 -37.837547087144706, 144.98191138368836
30 -37.82480198399865, 144.97076232908503
308 -37.818314889062094, 144.956839508202
289 -37.81105987177411, 144.95869339408225
109 -37.78077459328419, 144.95138857277198
45 -37.79443959174042, 144.9295031556217
243 -37.803343440196116, 144.9693670992385
273 -37.80282843793904, 144.9479395483275
135 -37.80111524772101, 144.96674878780823
geo_shape prop_id addresspt1 \
293 {"coordinates": [144.96918092237397, -37.78737... 0 0.000000
8 {"coordinates": [144.98191138368836, -37.83754... 0 41.441167
30 {"coordinates": [144.97076232908503, -37.82480... 0 26.353383
308 {"coordinates": [144.956839508202, -37.8183148... 0 35.877984
289 {"coordinates": [144.95869339408225, -37.81105... 0 31.787580
109 {"coordinates": [144.95138857277198, -37.78077... 107426 55.825150
45 {"coordinates": [144.9295031556217, -37.794439... 0 2.826674
243 {"coordinates": [144.9693670992385, -37.803343... 0 10.914450
273 {"coordinates": [144.9479395483275, -37.802828... 0 13.532624
135 {"coordinates": [144.96674878780823, -37.80111... 0 5.228496
addressp_1 asset_clas asset_type objectid str_id \
293 0 Signage Sign - Public Transport 39748 1252536
8 78 Signage Sign - Public Transport 2922 1248743
30 200 Signage Sign - Public Transport 15210 1239404
308 285 Signage Sign - Public Transport 44101 1268402
289 239 Signage Sign - Public Transport 36816 1252743
109 306 Signage Sign - Public Transport 7176 1244570
45 299 Signage Sign - Public Transport 23024 1577042
243 117 Signage Sign - Public Transport 16758 1240396
273 123 Signage Sign - Public Transport 29192 1251207
135 179 Signage Sign - Public Transport 17860 1240314
addresspt asset_subt model_desc mcc_id \
293 0 NaN Sign - Public Transport 1 Panel 1252536
8 107419 NaN Sign - Public Transport 1 Panel 1248743
30 540076 NaN Sign - Public Transport 1 Panel 1239404
308 105393 NaN Sign - Public Transport 1 Panel 1268402
289 577288 NaN Sign - Public Transport 1 Panel 1252743
109 111342 NaN Sign - Public Transport 1 Panel 1244570
45 106320 NaN Sign - Public Transport 1 Panel 1577042
243 612989 NaN Sign - Public Transport 1 Panel 1240396
273 107985 NaN Sign - Public Transport 1 Panel 1251207
135 106109 NaN Sign - Public Transport 1 Panel 1240314
roadseg_id descriptio model_no
293 22508 Sign - Public Transport 1 Panel Bus Stop Type 8 P.16
8 22245 Sign - Public Transport 1 Panel Bus Stop Type 8 P.16
30 22466 Sign - Public Transport 1 Panel Bus Stop Type 8 P.16
308 20118 Sign - Public Transport 1 Panel Bus Stop Type 8 P.16
289 20026 Sign - Public Transport 1 Panel Bus Stop Type 8 P.16
109 0 Sign - Public Transport 1 Panel Bus Stop Type 8 P.16
45 21693 Sign - Public Transport 1 Panel Bus Stop Type 1 P.16
243 20556 Sign - Public Transport 1 Panel Bus Stop Type 3 P.16
273 21015 Sign - Public Transport 1 Panel Bus Stop Type 8 P.16
135 20530 Sign - Public Transport 1 Panel Bus Stop Type 3 P.16
#viewing the bust stops data sets
bus_stop.head(3)
| geo_point_2d | geo_shape | prop_id | addresspt1 | addressp_1 | asset_clas | asset_type | objectid | str_id | addresspt | asset_subt | model_desc | mcc_id | roadseg_id | descriptio | model_no | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -37.80384165792465, 144.93239283833262 | {"coordinates": [144.93239283833262, -37.80384... | 0 | 76.819824 | 357 | Signage | Sign - Public Transport | 355 | 1235255 | 570648 | NaN | Sign - Public Transport 1 Panel | 1235255 | 21673 | Sign - Public Transport 1 Panel Bus Stop Type 13 | P.16 |
| 1 | -37.81548699581418, 144.9581794249902 | {"coordinates": [144.9581794249902, -37.815486... | 0 | 21.561304 | 83 | Signage | Sign - Public Transport | 600 | 1231226 | 548056 | NaN | Sign - Public Transport 1 Panel | 1231226 | 20184 | Sign - Public Transport 1 Panel Bus Stop Type 8 | P.16 |
| 2 | -37.81353897396532, 144.95728334230756 | {"coordinates": [144.95728334230756, -37.81353... | 0 | 42.177187 | 207 | Signage | Sign - Public Transport | 640 | 1237092 | 543382 | NaN | Sign - Public Transport 1 Panel | 1237092 | 20186 | Sign - Public Transport 1 Panel Bus Stop Type 8 | P.16 |
#The shape
bus_stop.shape
(309, 16)
pedestrian_hour.head()
| sensor_name | timestamp | locationid | direction_1 | direction_2 | total_of_directions | location | |
|---|---|---|---|---|---|---|---|
| 0 | SprFli_T | 2023-04-24T21:00:00+00:00 | 75 | 36 | 17 | 53 | -37.81515276, 144.97467661 |
| 1 | SprFli_T | 2023-04-25T00:00:00+00:00 | 75 | 28 | 50 | 78 | -37.81515276, 144.97467661 |
| 2 | SprFli_T | 2023-04-25T01:00:00+00:00 | 75 | 63 | 63 | 126 | -37.81515276, 144.97467661 |
| 3 | SprFli_T | 2023-04-25T02:00:00+00:00 | 75 | 85 | 89 | 174 | -37.81515276, 144.97467661 |
| 4 | SprFli_T | 2023-04-25T08:00:00+00:00 | 75 | 365 | 59 | 424 | -37.81515276, 144.97467661 |
working on the pedestrian _count dataset¶¶
### trying to get number of pedestrian from each row
pedestrian_counts = {}
# Iterate through each row in the DataFrame
for index, row in pedestrian_hour.iterrows():
# Extract location ID, location, timestamp, and pedestrian count from the row
location_id = row['locationid']
location = row['location']
timestamp = row['timestamp']
count = 1 # Since we're counting rows, each row represents one pedestrian
# Create a unique key using location ID and timestamp
key = (location_id, location, timestamp)
# Increment pedestrian count for the key
if key in pedestrian_counts:
pedestrian_counts[key] += count
else:
pedestrian_counts[key] = count
# Create lists to store the aggregated data
location_ids = []
locations = []
timestamps = []
counts = []
# Iterate through the pedestrian_counts dictionary and append data to lists
for (location_id, location, timestamp), count in pedestrian_counts.items():
location_ids.append(location_id)
locations.append(location)
timestamps.append(timestamp)
counts.append(count)
# Create a new DataFrame to store the aggregated pedestrian counts
ped_df = pd.DataFrame({
'locationid': location_ids,
'location': locations,
'timestamp': timestamps,
'pedestrian_count': counts
})
# Print the new DataFrame
ped_df.head()
| locationid | location | timestamp | pedestrian_count | |
|---|---|---|---|---|
| 0 | 75 | -37.81515276, 144.97467661 | 2023-04-24T21:00:00+00:00 | 1 |
| 1 | 75 | -37.81515276, 144.97467661 | 2023-04-25T00:00:00+00:00 | 1 |
| 2 | 75 | -37.81515276, 144.97467661 | 2023-04-25T01:00:00+00:00 | 1 |
| 3 | 75 | -37.81515276, 144.97467661 | 2023-04-25T02:00:00+00:00 | 1 |
| 4 | 75 | -37.81515276, 144.97467661 | 2023-04-25T08:00:00+00:00 | 1 |
ped_df.shape
(549967, 4)
working on sensor reading dataset¶
sensor_reading.columns
Index(['timestamp', 'mac', 'boardtype', 'boardid', 'temp_max', 'temp_min',
'temp_avg', 'light_max', 'light_min', 'light_avg', 'humidity_min',
'humidity_max', 'humidity_avg', 'model', 'latitude', 'longitude',
'elevation', 'location', 'rowid', 'position', 'geolocation'],
dtype='object')
##selecting the important column for my analysis into a new dataframe sensor_df
sensor_df = sensor_reading[['timestamp','temp_avg','light_avg','humidity_avg', 'geolocation']]
sensor_df.head()
| timestamp | temp_avg | light_avg | humidity_avg | geolocation | |
|---|---|---|---|---|---|
| 0 | 2015-01-24T10:45:00+00:00 | 19.4 | 0.9 | 52.7 | -37.8197121, 144.9413253 |
| 1 | 2015-01-24T11:15:00+00:00 | 19.7 | 10.6 | 50.2 | -37.8195002, 144.9418888 |
| 2 | 2015-01-24T11:15:00+00:00 | 19.7 | 3.1 | 57.9 | -37.8199043, 144.9404851 |
| 3 | 2015-01-24T11:55:00+00:00 | 19.7 | 3.1 | 53.7 | -37.8199043, 144.9404851 |
| 4 | 2015-01-24T11:55:00+00:00 | 18.7 | 1.0 | 48.6 | -37.8197121, 144.9413253 |
working on bus stops dataset¶¶
bus_stop.columns
Index(['geo_point_2d', 'geo_shape', 'prop_id', 'addresspt1', 'addressp_1',
'asset_clas', 'asset_type', 'objectid', 'str_id', 'addresspt',
'asset_subt', 'model_desc', 'mcc_id', 'roadseg_id', 'descriptio',
'model_no'],
dtype='object')
## selecting important columns for my analysis into a new variable
b_stops = bus_stop[['geo_point_2d','str_id','roadseg_id']]
b_stops.head()
| geo_point_2d | str_id | roadseg_id | |
|---|---|---|---|
| 0 | -37.80384165792465, 144.93239283833262 | 1235255 | 21673 |
| 1 | -37.81548699581418, 144.9581794249902 | 1231226 | 20184 |
| 2 | -37.81353897396532, 144.95728334230756 | 1237092 | 20186 |
| 3 | -37.82191394843844, 144.95539345270072 | 1232777 | 22174 |
| 4 | -37.83316401267591, 144.97443745130263 | 1271914 | 22708 |
b_stops.shape
(309, 3)
renaming the variable name of the location column for bustops and sensor dataset to be able to merge it¶
## renaming the variable name of the location column for bustops and sensor dataset to be able to merge it
new_sensor=sensor_df.rename(columns={'geolocation':'location'})
new_bstops=b_stops.rename(columns={'geo_point_2d':'location'})
new_sensor.head()
#new_bstops.head()
| timestamp | temp_avg | light_avg | humidity_avg | location | |
|---|---|---|---|---|---|
| 0 | 2015-01-24T10:45:00+00:00 | 19.4 | 0.9 | 52.7 | -37.8197121, 144.9413253 |
| 1 | 2015-01-24T11:15:00+00:00 | 19.7 | 10.6 | 50.2 | -37.8195002, 144.9418888 |
| 2 | 2015-01-24T11:15:00+00:00 | 19.7 | 3.1 | 57.9 | -37.8199043, 144.9404851 |
| 3 | 2015-01-24T11:55:00+00:00 | 19.7 | 3.1 | 53.7 | -37.8199043, 144.9404851 |
| 4 | 2015-01-24T11:55:00+00:00 | 18.7 | 1.0 | 48.6 | -37.8197121, 144.9413253 |
Merging the 3 processed dataset together¶
#spliting of the loaction in ped_df and new_bstops into long and lat
ped_df[['latitude', 'longitude']] = ped_df['location'].str.split(', ', expand=True)
new_bstops[['latitude', 'longitude']]=new_bstops['location'].str.split(', ', expand=True)
new_sensor[['latitude', 'longitude']]=new_sensor['location'].str.split(', ', expand=True)
new_bstops.head()
ped_df.head()
new_sensor.head()
| timestamp | temp_avg | light_avg | humidity_avg | location | latitude | longitude | |
|---|---|---|---|---|---|---|---|
| 0 | 2015-01-24T10:45:00+00:00 | 19.4 | 0.9 | 52.7 | -37.8197121, 144.9413253 | -37.8197121 | 144.9413253 |
| 1 | 2015-01-24T11:15:00+00:00 | 19.7 | 10.6 | 50.2 | -37.8195002, 144.9418888 | -37.8195002 | 144.9418888 |
| 2 | 2015-01-24T11:15:00+00:00 | 19.7 | 3.1 | 57.9 | -37.8199043, 144.9404851 | -37.8199043 | 144.9404851 |
| 3 | 2015-01-24T11:55:00+00:00 | 19.7 | 3.1 | 53.7 | -37.8199043, 144.9404851 | -37.8199043 | 144.9404851 |
| 4 | 2015-01-24T11:55:00+00:00 | 18.7 | 1.0 | 48.6 | -37.8197121, 144.9413253 | -37.8197121 | 144.9413253 |
#converting the location for the pedestrian dataset to float
ped_df[['latitude','longitude']] = ped_df[['latitude','longitude']].astype(float)
#df1['latitude_rounded'] = df1['latitude'].apply(lambda x: round(float(x), 7))
ped_df[['latitude', 'longitude']] = ped_df[['latitude', 'longitude']].apply(lambda x: x.apply(lambda y: round(float(y), 7)))
ped_df.head()
| locationid | location | timestamp | pedestrian_count | latitude | longitude | |
|---|---|---|---|---|---|---|
| 0 | 75 | -37.81515276, 144.97467661 | 2023-04-24T21:00:00+00:00 | 1 | -37.815153 | 144.974677 |
| 1 | 75 | -37.81515276, 144.97467661 | 2023-04-25T00:00:00+00:00 | 1 | -37.815153 | 144.974677 |
| 2 | 75 | -37.81515276, 144.97467661 | 2023-04-25T01:00:00+00:00 | 1 | -37.815153 | 144.974677 |
| 3 | 75 | -37.81515276, 144.97467661 | 2023-04-25T02:00:00+00:00 | 1 | -37.815153 | 144.974677 |
| 4 | 75 | -37.81515276, 144.97467661 | 2023-04-25T08:00:00+00:00 | 1 | -37.815153 | 144.974677 |
#converting the location for the busstops dataset to float
new_bstops[['latitude','longitude']] = new_bstops[['latitude','longitude']].astype(float)
new_bstops[['latitude','longitude']]=new_bstops[['latitude','longitude']].apply(lambda x: x.apply(lambda y: round(float(y),7)))
new_bstops.head()
| location | str_id | roadseg_id | latitude | longitude | |
|---|---|---|---|---|---|
| 0 | -37.80384165792465, 144.93239283833262 | 1235255 | 21673 | -37.803842 | 144.932393 |
| 1 | -37.81548699581418, 144.9581794249902 | 1231226 | 20184 | -37.815487 | 144.958179 |
| 2 | -37.81353897396532, 144.95728334230756 | 1237092 | 20186 | -37.813539 | 144.957283 |
| 3 | -37.82191394843844, 144.95539345270072 | 1232777 | 22174 | -37.821914 | 144.955394 |
| 4 | -37.83316401267591, 144.97443745130263 | 1271914 | 22708 | -37.833164 | 144.974437 |
#converting the location for the sensor_reading dataset to float
new_sensor[['latitude','longitude']] = new_sensor[['latitude','longitude']].astype(float)
new_sensor[['latitude','longitude']]=new_sensor[['latitude','longitude']].apply(lambda x: x.apply(lambda y: round(float(y),7)))
new_sensor.head()
| timestamp | temp_avg | light_avg | humidity_avg | location | latitude | longitude | |
|---|---|---|---|---|---|---|---|
| 0 | 2015-01-24T10:45:00+00:00 | 19.4 | 0.9 | 52.7 | -37.8197121, 144.9413253 | -37.819712 | 144.941325 |
| 1 | 2015-01-24T11:15:00+00:00 | 19.7 | 10.6 | 50.2 | -37.8195002, 144.9418888 | -37.819500 | 144.941889 |
| 2 | 2015-01-24T11:15:00+00:00 | 19.7 | 3.1 | 57.9 | -37.8199043, 144.9404851 | -37.819904 | 144.940485 |
| 3 | 2015-01-24T11:55:00+00:00 | 19.7 | 3.1 | 53.7 | -37.8199043, 144.9404851 | -37.819904 | 144.940485 |
| 4 | 2015-01-24T11:55:00+00:00 | 18.7 | 1.0 | 48.6 | -37.8197121, 144.9413253 | -37.819712 | 144.941325 |
#merging the datasets together
new_df = pd.concat([ped_df, new_sensor,new_bstops])
new_df.head()
| locationid | location | timestamp | pedestrian_count | latitude | longitude | temp_avg | light_avg | humidity_avg | str_id | roadseg_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 75.0 | -37.81515276, 144.97467661 | 2023-04-24T21:00:00+00:00 | 1.0 | -37.815153 | 144.974677 | NaN | NaN | NaN | NaN | NaN |
| 1 | 75.0 | -37.81515276, 144.97467661 | 2023-04-25T00:00:00+00:00 | 1.0 | -37.815153 | 144.974677 | NaN | NaN | NaN | NaN | NaN |
| 2 | 75.0 | -37.81515276, 144.97467661 | 2023-04-25T01:00:00+00:00 | 1.0 | -37.815153 | 144.974677 | NaN | NaN | NaN | NaN | NaN |
| 3 | 75.0 | -37.81515276, 144.97467661 | 2023-04-25T02:00:00+00:00 | 1.0 | -37.815153 | 144.974677 | NaN | NaN | NaN | NaN | NaN |
| 4 | 75.0 | -37.81515276, 144.97467661 | 2023-04-25T08:00:00+00:00 | 1.0 | -37.815153 | 144.974677 | NaN | NaN | NaN | NaN | NaN |
#i want to check for the percentage of the missing alue in each column
missing_percent = new_df.isna().sum().sort_values()/len(new_df)*100
missing_percent
location 0.000000 latitude 0.000000 longitude 0.000000 timestamp 0.050919 locationid 9.372889 pedestrian_count 9.372889 temp_avg 90.678030 light_avg 90.678030 humidity_avg 90.678030 str_id 99.949081 roadseg_id 99.949081 dtype: float64
plt.figure(figsize=(7,3))
new_par = missing_percent[missing_percent >= 5].plot.bar()
plt.gca().set_xlabel("columns")
plt.gca().set_ylabel("counts")
plt.gca().set_title("percentage of missing value")
plt.grid()
plt.show()
After merging the three datasets together.It resulted to a lot of nan for some of my important column which i believe is not good for my analysis so i decided to treat the datasets one after the other to derive different insight and pattern. before generating a certain amount of samples for clustering and decision making
## from pedestrian dataset
ped_df.head()
| locationid | location | timestamp | pedestrian_count | latitude | longitude | |
|---|---|---|---|---|---|---|
| 0 | 75 | -37.81515276, 144.97467661 | 2023-04-24T21:00:00+00:00 | 1 | -37.815153 | 144.974677 |
| 1 | 75 | -37.81515276, 144.97467661 | 2023-04-25T00:00:00+00:00 | 1 | -37.815153 | 144.974677 |
| 2 | 75 | -37.81515276, 144.97467661 | 2023-04-25T01:00:00+00:00 | 1 | -37.815153 | 144.974677 |
| 3 | 75 | -37.81515276, 144.97467661 | 2023-04-25T02:00:00+00:00 | 1 | -37.815153 | 144.974677 |
| 4 | 75 | -37.81515276, 144.97467661 | 2023-04-25T08:00:00+00:00 | 1 | -37.815153 | 144.974677 |
ped_df['timestamp'] = pd.to_datetime(ped_df['timestamp'])
ped_df.head()
| locationid | location | timestamp | pedestrian_count | latitude | longitude | |
|---|---|---|---|---|---|---|
| 0 | 75 | -37.81515276, 144.97467661 | 2023-04-24 21:00:00+00:00 | 1 | -37.815153 | 144.974677 |
| 1 | 75 | -37.81515276, 144.97467661 | 2023-04-25 00:00:00+00:00 | 1 | -37.815153 | 144.974677 |
| 2 | 75 | -37.81515276, 144.97467661 | 2023-04-25 01:00:00+00:00 | 1 | -37.815153 | 144.974677 |
| 3 | 75 | -37.81515276, 144.97467661 | 2023-04-25 02:00:00+00:00 | 1 | -37.815153 | 144.974677 |
| 4 | 75 | -37.81515276, 144.97467661 | 2023-04-25 08:00:00+00:00 | 1 | -37.815153 | 144.974677 |
getting the count of pedesrians hourly¶
ped_df['date'] =ped_df['timestamp'].dt.date
#ped_daily = ped_df.groupby('date')['pedestrian_count'].sum()
ped_df['hourly'] = ped_df['timestamp'].dt.floor('H')
ped_daily = ped_df.groupby('date', as_index=False)['pedestrian_count'].sum()
ped_hourly = ped_df.groupby('hourly',as_index=False)['pedestrian_count'].sum()
ped_hourly
| hourly | pedestrian_count | |
|---|---|---|
| 0 | 2023-03-31 13:00:00+00:00 | 71 |
| 1 | 2023-03-31 14:00:00+00:00 | 69 |
| 2 | 2023-03-31 15:00:00+00:00 | 68 |
| 3 | 2023-03-31 16:00:00+00:00 | 63 |
| 4 | 2023-03-31 17:00:00+00:00 | 64 |
| ... | ... | ... |
| 7510 | 2024-03-18 10:00:00+00:00 | 81 |
| 7511 | 2024-03-18 11:00:00+00:00 | 80 |
| 7512 | 2024-03-18 12:00:00+00:00 | 79 |
| 7513 | 2024-03-18 13:00:00+00:00 | 78 |
| 7514 | 2024-03-18 14:00:00+00:00 | 72 |
7515 rows × 2 columns
ped_daily
| date | pedestrian_count | |
|---|---|---|
| 0 | 2023-03-31 | 751 |
| 1 | 2023-04-01 | 1592 |
| 2 | 2023-04-02 | 1651 |
| 3 | 2023-04-03 | 1681 |
| 4 | 2023-04-04 | 1665 |
| ... | ... | ... |
| 341 | 2024-03-14 | 1930 |
| 342 | 2024-03-15 | 1935 |
| 343 | 2024-03-16 | 1945 |
| 344 | 2024-03-17 | 1904 |
| 345 | 2024-03-18 | 1216 |
346 rows × 2 columns
plotting the time series of pedestrian count for daily and hourly¶
plt.figure(figsize=(15,6))
# Subplot 1: Pedestrian counts over time (daily)
plt.subplot(1, 2, 1) # 1 row, 2 columns, 1st subplot
plt.plot(ped_daily['date'], ped_daily['pedestrian_count'], marker="o", linestyle="-")
plt.title('Pedestrian Count Over Time (Daily)')
plt.xlabel('Date')
plt.ylabel('Pedestrian Count')
plt.grid(True)
plt.xticks(rotation=45)
# Ensuring tight layout for the first subplot
plt.tight_layout()
# Subplot 2: Pedestrian counts over time (hourly)
plt.subplot(1, 2, 2) # 1 row, 2 columns, 2nd subplot
plt.plot(ped_hourly['hourly'], ped_hourly['pedestrian_count'], marker='o', linestyle="-")
plt.title('Pedestrian Count Over Time (Hourly)')
plt.xlabel('Hour')
plt.ylabel('Pedestrian Count')
plt.grid(True)
plt.xticks(rotation=45)
# Ensuring tight layout takes into account the second subplot as well
plt.tight_layout()
# Display the figure with both subplots
plt.show()
Peak Activity: A notable pattern observed in the data is the occurrence of days with significantly high pedestrian counts, ranging between 1750 and 2000. These peaks suggest periods of increased pedestrian activity that could be attributed to specific factors such as weekday rush hours, public events, or favorable weather conditions.
Low Activity: Conversely, the data also revealed days with markedly low pedestrian counts, approximately around 250. These troughs in pedestrian activity may indicate periods of reduced foot traffic, possibly due to adverse weather conditions, weekdays with no significant events, or other deterrents to outdoor activity.
ped_df['days_of_week'] = ped_df['timestamp'].dt.day_name()
count_by_day = ped_df.groupby('days_of_week',as_index=False)['pedestrian_count'].sum()
trying to get the day of the week with highest pedestrian traffic¶
count_by_day
| days_of_week | pedestrian_count | |
|---|---|---|
| 0 | Friday | 75574 |
| 1 | Monday | 79832 |
| 2 | Saturday | 80036 |
| 3 | Sunday | 78689 |
| 4 | Thursday | 76348 |
| 5 | Tuesday | 79470 |
| 6 | Wednesday | 80027 |
# putting the ordeer of the day in a variable
day_order = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
ped_df['days_of_week'] = pd.Categorical(ped_df['timestamp'].dt.day_name(), categories=day_order, ordered=True)
# Group by 'days_of_week' and sum 'pedestrian_count'
count_by_day = ped_df.groupby('days_of_week', as_index=False)['pedestrian_count'].sum()
# Plotting
plt.figure(figsize=(8, 5))
# When plotting, specify 'x' and 'y' explicitly to ensure correct columns are used
plt.bar(count_by_day['days_of_week'], count_by_day['pedestrian_count'], color='skyblue')
plt.ylim(50000, 100000)
plt.yticks(np.arange(50000, 105000, 5000))
plt.title('Total Pedestrian Count by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Total Pedestrian Count') # Changed "Average" to "Total" since we're summing counts
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--')
plt.show()
Peak Pedestrian Days: Monday, Wednesday, and Saturday were identified as the days with the highest pedestrian traffic. These days exhibit significantly higher pedestrian volumes compared to other days of the week, indicating a pattern of increased activity that could be attributed to specific weekly events, market days, or leisure activities typical of these days.
Lowest Pedestrian Traffic: Friday was observed to have the lowest pedestrian count among all days of the week. This reduction in pedestrian activity could reflect a weekly variation in social or commercial patterns, such as alternative entertainment options, travel patterns, or shopping
trying to get the significant concentration of pedestrian traffic in specific area¶
location_totals = ped_df.groupby('location')['pedestrian_count'].sum().reset_index()
location_totals
| location | pedestrian_count | |
|---|---|---|
| 0 | -37.79432415, 144.92973378 | 6682 |
| 1 | -37.79453803, 144.93036194 | 6738 |
| 2 | -37.79690473, 144.96440378 | 5007 |
| 3 | -37.79808192, 144.96721013 | 6854 |
| 4 | -37.79844526, 144.96411782 | 4293 |
| ... | ... | ... |
| 87 | -37.8204637, 144.94126826 | 1780 |
| 88 | -37.82129925, 144.96879309 | 7297 |
| 89 | -37.82293543, 144.9471751 | 7190 |
| 90 | -37.82401776, 144.95604426 | 4414 |
| 91 | -37.82590962, 144.96185972 | 1546 |
92 rows × 2 columns
# Sort by total counts (or use 'location_averages' if working with averages)
location_totals_sort = location_totals.sort_values(by='pedestrian_count', ascending=False)
top_15_location = location_totals_sort.head(15)
plt.figure(figsize=(10, 6))
sns.barplot(x='pedestrian_count', y='location', data=top_15_location, palette='viridis')
plt.title('Total Pedestrian Traffic by Location')
plt.xlabel('Total Pedestrian Count')
plt.ylabel('Location')
plt.grid(axis='x', linestyle='--')
plt.show()
High Pedestrian Traffic Concentration: The graph reveals a significant concentration of pedestrian traffic in specific areas, with the top location registering the highest count. This suggests a strong preference or need for pedestrian access in these areas, potentially driven by commercial, recreational, or transit-related activities.
Variability Among Top Locations: There is noticeable variability in pedestrian counts among the top 15 locations. While the top locations show exceptionally high pedestrian traffic, there is a gradual decrease as we move down the list, indicating a steep drop-off in foot traffic outside of the most frequented areas.
Location Characteristics: The busiest locations likely share common characteristics that attract high pedestrian volumes, such as proximity to public transportation hubs, commercial districts, or key urban attractions
mapping-------out the peestrian traffic in different loactions¶
# Prepare data for the HeatMap
data = ped_df[['latitude', 'longitude', 'pedestrian_count']].values.tolist()
# Initialize your map around the center of your dataset
map_center = [ped_df['latitude'].mean(), ped_df['longitude'].mean()]
m = folium.Map(location=map_center, zoom_start=13)
# Add a HeatMap layer to the map
HeatMap(data).add_to(m)
# Display the map
m